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1.  INTRODUCTION 

Several  neurocomputing  paradigms,  generally  known  as  artificial 
neural  networks,  have  been  proposed  in  recent  years.  Some  of  the  most 
popular  are  the  Hopfield  Net  (References  1  and  2),  the  Adaptive  Resonance 
Theory  of  Grossberg  and  Carpenter  (References  3  through  6),  the  Adaptive 
Linear  Combiner  (ADALINE,  MADALINE)  of  Widrow  (References  7  and  8), 
Rumelhart's  Multilayered  Neural  Networks  (References  9  and  10),  and 
variations  of  these.  The  associative  memory  capacity  of  Hopfield-type  nets 
has  been  analyzed  by  several  authors  (References  11  through  15);  the 
capacity  of  multilevel  threshold  functions  was  investigated  (Reference  16). 
In  this  paper,  we  study  the  capacity  of  multilayered  neural  networks. 

An  upper  bound  on  the  number  of  patterns  (input-output  pairs)  that  a 
layered  neural  network  can  learn  is  derived.  The  result  is  obtained  by 
applying  some  results  from  dimension  theory  to  a  set  of  equations  that  the 
input-output  pairs  must  satisfy  for  the  given  architecture.  These  equations 
are  interesting  by  themselves.  They  have  a  dual  interpretation.  For  a  given 
set  of  interconnection  weights,  the  equations  represent  the  transfer  function 
(TF)  between  input  and  output  patterns.  Thus,  if  the  net  is  to  have  a  desired 
TF — that  is,  a  desired  set  of  input-output  pairs  of  patterns — then  the 
interconnection  weights  must  satisfy  the  TF  equations.  In  this  sense,  the 
equations  represent  equations  of  learning. 

In  Section  2,  the  general  architecture  of  our  layered  neural  networks 
(LNNs)  will  be  presented  together  with  the  notation  that  we  shall  use  to 
represent  the  general  TF  in  closed  form.  In  Section  3,  we  shall  interpret  these 
equations  as  defining  a  mapping  from  weight-space  to  output-space,  at  which 
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point  the  result  on  the  capacity  of  layered  networks  will  follow  by  a  simple 
application  of  a  result  from  dimension  theory.  In  order  to  make  accessible 
the  results  from  dimension  theory  that  we  shall  use,  an  introduction  is  given 

in  Section  4  that  is  geared  to  give  the  reader  a  feeling  for  the  concepts 
involved.  A  technical  definition  is  needed  to  be  able  to  state  the  results 

precisely.  Two  simple  examples  that  attempt  to  illustrate  the  concepts 
introduced  in  Section  3  and  the  theoretical  results  of  Section  4  are  also 
included.  They  show  some  of  the  limitations  of  the  theory  and  also  lead  to  the 

discovery  of  certain  features  of  the  sigmoid  function  that  may  affect  the 

performance  of  an  LNN.  These  features  of  the  sigmoid  functions  are  stated  in 
the  form  of  Propositions  whose  proofs,  being  highly  technical,  are  relegated 
to  the  Appendix.  In  Section  5,  we  point  out  some  of  the  interesting 
conclusions  that  follow  from  the  results  of  Section  4.  For  instance,  it  will 
become  clear  that  reducing  the  dimensionality  of  the  output  patterns 
increases  the  capacity  of  the  net.  A  couple  of  theorems  on  architectures  with 
maximal  capacity  are  also  included  in  Section  S.  The  summary  and  a  few 
conclusions  are  the  content  of  Section  6. 


2.  LAYERED  NEURAL  NETWORKS  AND  THEIR  TRANSFER  FUNCTION 

An  LNN  is  a  network  consisting  of  layers  of  neurons  (processing 

elements)  connected  to  each  other  through  weighted  connections.  The  output 
Oj  of  the  ith-neuron  is  equal  to  S(Ij),  where  Ij  denotes  its  input  [assumed  to  be 
a  real  number  (Ij  e  /?)]  and  S  is  a  nonlinear  function  called  a  sigmoid  function 

or  squashing  function.  In  some  cases,  S  is  a  threshold  function.  The 
squashing  function  S  is  monotonically  increasing,  bounded  above  and  below, 
and  usually  is  differentiable;  thus,  its  graph  looks  like  that  in  Figure  1.  It  also 

could  be  piecewise  linear.  Figure  2  shows  the  graph  of  a  threshold  function. 

Throughout  this  paper,  we  shall  assume  that  S  is  a  continuous  function 

mapping  R  into  the  interval  /  =  [x  :  -1  S  x  S  1}.  Let  In  *  {(xj,  ....  xn)  ;  xj  e 

/.  i  =  1.  2 . n}. 
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The  input  to  a  neuron  is  the  weighted  sum  of  the  outputs  of  the  neurons 
in  the  preceding  layer.  Thus,  between  two  layers  of  neurons,  layer  1  of  size 
ni|  (a  layer  with  neurons)  and  layer  2  of  size  m2,  we  have  an  m jXmj  - 
matrix  of  weights  Wj.  The  input  to  the  ith -neuron  of  layer  2  equals 

m, 

1,=  ^  WjjC^,  where  Oj  denotes  the  output  of  the  jth-neuron  of  layer  1,  and  Wj 
j-1 


=  (wjj),  (i  =  1,  2,  3,  ....  m2;  j  =  1,  2,  3 . n^). 


T  m  _ 

Let  I  =  (I j ,  I2,  ....  Im  )  e /?  ^  denote  the  vector  of  inputs  to  layer  2  and  O 

T  tn  T 

(Oj,  02  ....  Omi)  e/  *  denote  the  vector  of  outputs  from  layer  1  [(•)  means 


transpose  of  (•)].  Then,  I  =  WjO  and  the  output  vector  of  layer  2  is  Sm  (I), 

T  n 

where  the  notation  Sn(x)  means  that  x  =  (xj,  x2,  ...,  xn)  e  R  and  Sn(x)  =  (S(xj), 
S(x2),  ....  S(xn))T.  Thus, 


Sm2  (Wj  6) 


(2.1) 


represents  the  output  vector  of  layer  2  in  terms  of  (a)  the  output  vector  O  of 
layer  1,  (b)  the  weighting  matrix  Wj  between  layers  1  and  2,  and  (c)  the 
sigmoid  function  Sm^. 


Formula  2.1  is  the  basic  building  block  for  the  general  formula 
(Formula  2.2  below)  for  multilayered  networks.  One  can  think  of  Formula  2.1 
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as  representing  the  Transfer  Function  from  output  of  layer  1  to  output  of 
layer  2. 

Before  presenting  the  general  formula  for  a  network  with  L  layers  of 
weights,  note  that  the  first  layer  of  neurons  (layer  1)  has  an  input  that  will 
be  the  input  to  the  whole  network.  Denote  this  input  by  I*.  Then,  the  output 

O  of  the  first  layer  is  simply  O  =Sm^(f*).  Now,  the  general  formula  for  the 

TF  from  input  to  output  CXe/mL+l  of  a  neural  network  with  (L  +  1) 

layers  of  neurons  (L  layers  of  weights)  of  sizes  m;,  (i  ==  1,  2,  ....  L  +  1)  is  given 
by 

6*  =  SmL+l(WLSmL(WL.lSmL.1("SI„2(W1Smi(f,)) ))).  (2.2) 

Clearly,  the  above  mapping  is  a  series  of  compositions  of  two  basic  operations: 
multiplication  by  a  weighting  matrix  Wk  (a  linear  transformation)  followed 
by  the  sigmoid  function  Sm  .  So,  onc  can  say  cach  Sm°Wm-l  rcPrcscnts 

a  layer  of  m  neurons. 

The  notation  in  Equation  2.2  means  the  following: 

( 1 )  W  k  is  the  kth  matrix  of  weights.  These  are  the  weights  between  layer  k 
of  size  mk  and  layer  k+1  of  size  mk+1,  (k  =  1,  2,  ....  L). 

(2)  Wk  Sm^  (•)€/?m,c+*  represents  the  input  to  layer  k+1,  (k  =  1,  2,  ....  L). 

(3)  SmkO)e*m*C+1  represents  the  output  of  layer  k+1,  (k  =  1,  2, 

....  L). 

Note  that  the  subindices  of  the  Ss  specify  the  sizes  of  the  different  layers  and 
the  dimensions  of  the  matrices  in  between.  Equation  2.2  defines  precisely  the 
architecture  of  the  multilayered  network. 

Remark  2.1  -  Each  matrix  of  weights  W;  has  nij  +  1x  m,  entries  (i  =  1,  2, 
....  L),  and  each  of  these  entries  can  be  adjusted  independently  of  the  entries 
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of  all  the  other  matrices.  Thus,  the  number  M  of  independent  parameters 

needed  to  uniquely  specify  the  mapping  defined  by  the  right-hand  side  (RHS) 

of  Equation  2.2  is  given  by 

L 

M  =  X  mi+i  x  mi-  (2.3) 

i«l 

Let  us  form  an  M-dimensional  vector  out  of  the  M  weights  needed  to  uniquely 

specify  the  RHS  of  Equation  2.2,  call  this  vector  W,  and  let  TW:R  ml->  /mL+l 

represent  the  mapping  O,  defined  by  the  RHS  of  Equation  2.2.  We  have 

included  the  vector  W  in  the  above  notation  to  remind  us  that  the  mapping 
f^O.  depends  on  We  R  M.  With  this  notation,  we  can  now  write  Equation  2.2 

as 

Q.=  Tw(f„)  ,  (  F*e  /?  n  1  ).  /// 

Remark  2.2  -  Since  the  sigmoid  function  S  that  we  are  using  maps 
zero  into  zero  (Figure  1),  the  function  Sn  also  will  map  the  zero  vector  in  Rn 
into  itself,  and  consequently  so  will  Tw .  T'.at  is,  if  0n  represents  the  zero 

vector  in  /?",  then  Tw(0mj)  =  ®mL  +  1  no  matter  what  W  might  be.  In  some 

applications,  we  might  want  to  map  9m  j  to  a  nonzero  vector  in  R  mL+l_  Then  it 
makes  sense  to  introduce  a  "shift"  in  the  functions  Sn.  So,  instead  of  using  Sn 
as  defined  previously,  we  could  define  Sn:Rn-*/nby 

Sn(x)  =  (S(xj  +  P !>,  S(x2  +  P2) . S(xn  +  Pn^T*  Vx  =  (*1*  x2 . xn^T  e/?n>  (2  4) 

T 

where  0=  (0j,  02 . 0n)  is  a  vector  of  shifts.  Now,  if  the  sigmoid  functions 

in  Equation  2.2  are  shifted  as  in  Equation  2.4,  then  the  mapping  so  defined 

has  more  free  parameters.  The  increase  in  free  parameters  is  N,  where 


The  symbol  M  indicates  the  end  of  a  proof,  a  remark,  or  an  example. 
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(2.5) 


Let  us  form  an  N-dimensional  vector  out  of  the  N-shift  parameters  (call  this 
vector  B)  and  let  TWB:/?  mi_»  /mL+l  represent  the  mapping  defined  by  the 
RHS  of  Equation  2.2  when  the  sigmoid  functions  Sm .  are  shifted  by  a  vector 

Pj€  R  m‘,  (i  =  1,  2,  3,  ....  L+l).  Now,  the  mapping  6*  depends  on  W  g  R  M  and 
N 

also  on  Be/?  ;  we  can  write  Equation  2.2  as 
0*  =  TW.B(r*)-  (I*eKmi). 

Note  that  if  B  =  0N.  then  Tw  B  =  Tw.  We  shall  refer  to  W  as  the  weight  vector  and 
B  as  the  shift  vector.  //// 

Remark  2.3  -  We  did  not  specify  exactly  how  to  form  the  vector  W 
from  the  entries  of  the  matrices  Wj,  (i  =  1,  2,  ....  L),  because  this  will  not  be 

relevant  for  the  analysis  in  Section  4.  Only  the  dimension  of  W  will  be 

relevant.  Similarly,  we  shall  not  specify  how  to  form  the  vector  13,  only  its 
dimension  will  be  of  relevance.  However,  once  we  decide  on  a  specific  way  of 
forming  the  weight  vector  W  and  the  shift  vector  B,  then  the  integer 

L,  the  (L  +  l)-iuplc  of  indices  m  ■  (mj,  m2,  ....  mL+1),  and  the  vectors  W  and  B 

uniquely  describe  a  particular  layered  network.  So,  we  may  refer  to  a 
particular  layered  network  as  the  layered  network  (L,  m  ,  W,  B)  with 
Transfer  Function  Tw  B :  R  m  1  l  m  L+ 1 .  //// 


k 


3.  INTERPRETATION  OF  Tw  B 

For  a  fixed  weight-vector  W  and  a  fixed  shift-vector  B,  Tw  B  defines  the 

relation  between  the  inputs  and  outputs  of  the  layered  net.  Thus,  in  this 
sense,  Tw  B  is  a  transfer  function.  In  this  section,  we  shall  interpret  Tw  B  a  bit 

differently,  however.  We  shall  think  of  the  input  pattern  I  *  as  a  pattern  to  be 
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learned  or  recognized ;  the  output  O*  will  be  the  desired  response  to  pattern 
r*.  If  the  weight  and  shift  vectors  are  such  that  Equation  2.2  holds,  we  will 

# 

say  that  the  network  has  learned  pattern  I*.  So,  we  will  have  a  collection  of 

input  patterns  J  *  {f„ :  *  =  1,  2,  ....  k}  to  be  learned  and  a  corresponding 

collection  of  desired  responses  (O*:  *  =  1,  2,  ....  k}.  The  goal  is  to  determine  a 

set  of  weights  (the  learning  process)  so  that  Equation  2.2  will  hold,  at  least 
approximately,  for  *  =  1,  2,  3,  ....  k.  In  case  the  sigmoid  functions  are  shifted, 
then  the  goal  is  to  determine  a  weight-vector  W  and  a  shift-vector  B  so  that 

Vs  0  *)  =  6*  .  for  *  =  1.  2,  3 . k.  (3.1) 

If  we  think  of  (W,B)  as  a  point  in  /?Mx/?N,  then  we  can  interpret  the 
learning  process  as  a  process  of  finding  a  point  in  /?Mx/?N  that  satisfies 
(perhaps  approximately)  the  set  of  nonlinear  equations  defined  by  Equation 
3.1. 


Thus,  for  a  given  set  of  input  patterns  J  and  a  given  set  of  desired 
responses  {O*:  *  =  1,  2,  ....  k).  Equation  3.1  represents  a  set  of  nonlinear 

equations  in  the  variable  (W,B)  €/?Mx/?^.  We  shall  call  these  equations  the 
Equations  of  Learning,  because  these  are  the  equations  that  need  to  be 
satisfied  by  (W,B)  if  the  network  is  to  learn  the  set  of  patterns  J. 

Definition  3.1  -  The  layered  network  (L,  m  ,  W,  B)  with  TF  function 
T w  b  (scc  Remark  2.3)  is  said  to  have  learned  a  set  of  patterns  J  perfectly  if 

Equation  3.1  is  satisfied  exactly  for  *  =  1,  2 . k. 

In  the  next  section,  we  derive  an  upper  bound  on  the  number  of 
patterns  that  a  layered  neural  network  can  learn  perfectly. 

To  conclude  this  section,  we  will  introduce  the  last  set  of  symbols  and 
notation  that  will  be  needed  in  Section  4.  This  is  done  mainly  for  two  reasons: 
to  emphasize  (1)  the  fact  that  in  the  equations  of  learning  the  variable  is  the 
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point  (W,B)  in  /?^x  /?N,  so  the  notation  will  explicitly  indicate  this,  and  (2)  that 
we  are  dealing  with  a  mapping  F  from  "(weight  x  shift)-space"  /?Mx/?N  into 
"output-space"  /lcmL+l. 

For  a  fixed  set  of  input  patterns  J  ■  {F*  :  *  =  1,  2,  3 . k},  define  F: 

„M  „N  ,k  mT  , ,  , 

R  xR  -»  /  L+1  by 


F(W,  B) 


Tw.b^i) 


V  (W,  B)  €  /?M  X  /?N 


TW,B^V 


(3.2) 


Now  we  can  write  the  system  of  Equations  3.1  equivalently  as 


F(W,  B)  = 


A 

A 


k  m 

Let  Dj  e  /  L+l  denote  the  vector  of  desired  responses  for  the  set  of  input 
patterns  J.  That  is,  let 


O, 


Then  the  equations  of  learning  can  succinctly  be  written  as 


I 


F(W.B)  =  Dj . 


(3.3) 


Remark  3.1  -  As  the  point  (W,B)  varies  through  the  (weight  x  shift)- 
space  /?Mx  /?N,  F(W,B)  traces  the  achievable  responses.  If  Dj  is  an  achievable 
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response,  then,  of  course,  Equation  3.3  has  a  solution.  The  preimage  F'^Dj) 

represents  the  set  of  all  combinations  of  weight  vectors  and  shift  vectors  that 
will  solve  Equation  3.3.  Thus,  Equation  3.3  has  a  solution  if — and  only  if — 
F_1(Dj)  is  not  empty.  (Recall  that  F'^Dj)  =  ((W,  B)  e/?M  x/?N;  F(W,  B)  =  Dj}.)  //// 

Remark  3.2  -  The  learning  process  may  be  interpreted  as  a  process  of 
finding  (numerically)  approximate  solutions  to  the  system  of  Equation  3.3. 
Numerical  algorithms  for  solving  systems  of  nonlinear  equations  such  as 
Newton's  Method,  Conjugate  Gradient,  Steepest  Descent  and  others  (References 
17  and  18)  can  be  used  as  'learning  algorithms'  when  applied  to  Equation  3.3. 
Thus,  a  host  of  new  learning  algorithms  are  now  at  our  disposal.  Some  of 
these  may  prove  to  be  faster  and/or  more  efficient  than  the  currently  applied 
learning  algorithms  such  as  the  Delta-Rule  or  Back-Propagation-Error 
(Reference  9).  //// 

4.  CAPACITY  OF  A  LAYERED  NEURAL  NETWORK 

Assume  that  a  set  of  input  patterns  J  and  a  vector  Dj  of  desired 
responses  are  given  and  consider  the  system  of  Equation  3.3,  the  equations  of 
learning.  If  k  (the  number  of  patterns  to  be  learned)  is  too  large,  then  the  set 
of  Equation  3.3  will  be  overdetermined  and  we  may  not  be  able  to  find  a 
solution  (W,B)  in  SMx  On  the  other  hand,  if  k  is  small  enough,  there  may 

be  an  infinite  number  of  solutions  to  Equation  3.3.  Exactly  how  large  k  may 
be  and  still  hope  to  be  able  to  solve  Equation  3.3  is  the  result  we  seek.  We  shall 
use  results  from  dimension  theory  to  obtain  an  upper  bound  on  k.  Before 
stating  the  result  formally,  we  would  like  to  quote  a  paragraph  from  the 
Introduction  in  Reference  19,  p.  7,  which  expresses  the  main  idea  in  simple 
terms: 

"Let  f^(x  j ,  ....  xn ),  i  =  1,  ....  m  be  m  continuous  real  valued 
functions  of  n  real  unknowns,  or  what  is  the  same,  m  continuous 
real-valued  functions  of  a  point  in  Euclidean  n-space.  It  is  one 
of  the  basic  facts  of  analysis  that  the  system  of  m  equations  in  n 
unknowns. 
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ffixj . x„)  =  0,  i  =  1.  2 . m 

has,  in  general,  no  solution  if  m>n.  The  words  "in  general"  may 
be  made  precise  as  follows:  by  modifying  the  functions  /,  very 
little  one  can  obtain  new  continuous  functions  gi  such  that  the 

new  system 

g^Xj . xn)  =  0,  i  =  1,  2 . m 

has  no  solution.  On  the  other  hand,  there  do  exist  sets  of  n 
equations  in  n  unknowns  which  are  solvable,  and  which  remain 
solvable  after  any  sufficiently  small  modification  of  their  left 
members.  This  property  of  Euclidean  n -space  can  be  made  the 
basis  of  a  general  concept  of  dimension.” 

Thus,  all  we  need  to  do  is  to  count  the  number  of  scalar  equations  that  we  have 
in  Equation  3.3  and  compare  that  number  with  the  number  of  unknowns. 

Recall  that  the  dimension  of  the  output  vectors  Q,  is  mL+1;  hence,  the 
dimension  of  Dj  is  kmL+1  and,  therefore.  Equation  3.3  is  a  system  of  k-mL+1 
scalar  equations.  The  number  of  unknowns  in  Equation  3.3  equals  the 
dimension  of  W  plus  the  dimension  of  B;  that  is,  M  +  N.  So,  "in  general the 
system  of  Equation  3.3  has  no  solution  unless  k-mL+1  S  M  +  N,  or 

k  S  — l—  (M  +  N)  . 

mL+l  (4.1) 

where  M  is  given  by  Equation  2.3  and  N  is  given  by  Equation  2.5. 

Inequality  4.1  is  the  main  result  of  this  work.  Interesting 
consequences  that  follow  from  Inequality  4.1  will  be  discussed  in  Section  5. 
The  theory  needed  to  give  a  rigorous  justification  of  Inequality  4.1  is  highly 
technical  and  it  can  be  found  in  Reference  19.  Here  we  shall  include  the 
minimum  amount  of  theory  needed  to  make  this  work  complete  and  self- 
contained. 

We  should  point  out  that  the  source  of  the  technical  complications  is  in 
the  statement  "the  system  of  m  equations  in  n  unknowns  /;(*/.  ■*„)  =  0,  (t  = 
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1 . m)  has,  in  general,  no  solution  if  ra  >  n."  This  statement  is  false  without 

the  words  "in  general."  Without  these  words,  the  resulting  statement  has 
many  exceptions,  some  of  which  are  very  profound.  Therefore,  in  order  to 
report  a  correct  statement  with  some  degree  of  generality,  one  must  be  very 
careful  of  the  wording.  The  following  definition  will  describe  precisely  what 
we  mean  by  the  words  "in  general"  in  the  above  statement. 

Definition  4.1  (Reference  19,  p.  74)  -  Suppose  f  is  a  mapping  from  a 
space  X  into  a  space  Y.  A  point  y  of  f(X)  is  called  an  unstable  value  of  f  if  for 
every  positive  5  there  is  a  mapping  g  from  X  into  Y  satisfying 

(i)  d(f(x),  g(x))  <  5,  for  every  xe  X, 

(ii)  g(X)  c  Y  -  { y ) . 

Other  points  of  f(X)  are  called  stable  values  of  f. 

Comments.  It  is  assumed  that  Y  is  a  metric  space  and  d(v)  in  (i)  is  the 
distance  function.  The  notation  f(X)  represents  the  image  of  X  under  f;  that  is, 
f(X)  =  (f(x)  :  x  e  X}.  If  A  and  B  are  two  sets,  then  A  -  B  =  (a  e  A  :  a  is  not  an 
element  of  B).  The  symbols  "c "  and  "e  "  have  the  standard  set  theoretic 
meaning. 

To  see  the  relevance  of  Definition  4.1  to  our  problem,  consider  the 
equation 


f(x)  =  y.  (4.2) 

If  y  €  f(X),  then  Equation  4.2  has  a  solution  x  e  X.  However,  if  y  is  an 
unstable  value  of  f,  then  there  are  arbitrarily  small  perturbations  of  f,  like 
the  function  g,  such  that  the  perturbed  equation  g(x)  =  y  has  no  solution  since 
ye  g(X).  On  the  other  hand,  if  y  is  a  stable  value  of  f,  then  g(x)  =  y  has  a 
solution  for  all  sufficiently  small  perturbations  g  of  f.  Clearly,  we  want  to 
avoid  having  unstable  values  on  the  RHS  of  Equation  3.3.  The  next  theorem 
describes  cases  in  which  all  the  values  of  a  function  are  unstable. 
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Theorem  4.1  (Reference  19,  p.  75)  -  Let  X  be  a  space  of  dimension  less 
than  n  and  f  a  mapping  (continuous  function)  of  X  into  In.  Then,  all  values  of  » 

f  are  unstable. 

Fact  4.1  (Reference  19,  p.  41)  -  The  dimension  of  Euclidean  n-space  Rn 

is  n. 

Theorem  4.2  -  Let  F  :  RM  x  /?N-»/k‘mL+l  be  the  mapping  defined  by 
Equation  3.2.  If  Inequality  4.1  is  violated,  then  every  value  of  F  is  unstable. 

Proof  -  Let  X  =  R  x  R  ;  then  the  dimension  of  X  is  M  +  N,  since  it  is 

homeomorphic  to  /?M+N  (Reference  19).  If  k  exceeds  the  bound  in  Inequality 
4.1,  then  the  dimension  of  X  is  less  than  k-mL  +  1.  Since  F  is  continuous,  every 

value  of  F  is  unstable  by  Theorem  4.1.  //// 

Theorem  4.2  describes  the  extreme  situation  in  which  all  the  values  of 
F  are  unstable.  This  is  certainly  an  undesirable  situation.  In  this  sense,  the 
RHS  of  Inequality  4.1  is  indeed  an  upper  bound  on  the  number  k  of  patterns 

that  the  layered  net  can  be  expected  to  learn.  Note  that  it  does  not  guarantee 

that  Equation  3.3  has  a  solution  when  k  satisfies  Inequality  4.1;  there  may  be 
other  conditions  to  be  satisfied  by  the  input-output  pairs  other  than  the 
restriction  on  the  number  of  pairs  imposed  by  Inequality  4.1.  Examples  4.1 
and  4.2  below  illustrate  this  point.  Because  of  the  architecture  of  the  network, 
there  are  certain  input-output  pairs  that  are  not  achievable  regardless  of  the 
values  of  the  weights  or  shifts. 

Example  4.1  -  Consider,  for  example,  the  simplest  possible  case  in 
which  L  a  1,  raj  »  1  s  m2,  so  that  the  weight-vector  W  is  a  scalar  and  the  shift 

vector  B  =  (fij,  P2)  *s  a  2-vector.  The  TF  is  given  by 

4 

Twb(I)  =  S(WS(I  +  Pi)  +  P2).  (Is*1). 

Since  N  =  2,  M  =  l,  and  mL+1  =  m2  =  1,  k  must  be  less  than  or  equal  to  3.  We 
cannot  expect  to  learn  more  than  three  patterns,  in  general.  We  shall  show 
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by  a  simple  argument  that  there  are  sets  of  three  input-output  pairs  that  are 
not  achievable  by  this  network;  hence,  they  cannot  be  learned. 

Note  that  V  B  is  a  monotone  function  of  the  input  I.  This  is  because  it  is 

a  composition  of  two  sigmoids  which  are  monotone  functions.  Thus,  if  we 
order  the  input  patterns  in  an  increasing  order,  say  Ii  <  I2  <  I3,  then  only 
output  patterns  that  are  either  increasing  Oi  <  O2  <  O3  or  decreasing  Oi  >  O2  >  O3 
are  achievable  by  this  network.  We  can  see  from  this  simple  example  that 
there  may  be  restrictions  on  the  input-output  pairs  other  than  their 
cardinality.  In  this  case,  if  the  monotonicity  condition  is  not  satisfied  by  the 
three  output  patterns,  then  they  are  not  realizable. 

We  are  going  to  go  a  step  further  and  show  that  for  this  network  and 
certain  sigmoids  there  are  sets  of  three  input-output  pairs  { (Ij,  0,):  i  =  1,  2,  3} 
that  satisfy  the  monotonicity  condition  and  are  not  realizable.  This  also  will 
serve  to  illustrate  the  ideas  developed  in  Section  3  by  going  through  the  steps 
involved  in  solving  the  equations  of  learning  for  this  simple  case.  For  larger 
networks,  of  course,  one  would  use  numerical  methods  as  pointed  out  in 
Remark  3.2. 

Suppose  then  that  the  pairs  to  be  learned  satisfy  the  monotonicity 
condition;  say  Ij  <  I2  <  I3  and  Oj  <  O2  <  O3.  We  want  to  find  out  if  there  exist  a 
weight  W  and  two  shift  parameters  Pi  and  P2  in  R  such  that  the  equations  of 
learning  are  satisfied.  That  is,  so  that 

S(WS(Ii  +  Pi)  +  p2)  =  Oi  ,  for  i  =  1,  2,  3.  (4.3) 

We  have  three  equations  and  three  free  parameters.  Since  S  is  invertible,  the 
first  thing  we  might  do  to  solve  this  system  of  equations  is  to  take  S*1  (the 
inverse  of  S)  on  both  sides  of  Equation  4.3  to  obtain 

WS  (I;  +  Pi)  +  p2  =  S'kOi) ,  (  i  =  I,  2,  3  ). 

Next,  let  i  =  1,  solve  for  W,  and  substitute  the  expression  for  W  in  the  other  two 
equations.  This  leads  to 
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Sdj  +  Pi) 


and 


S_1(Oi)-P2 

S(I1  +  p!> 


S«i+Pi)  +  p2=S_1(Oi), 


( i  =  2,  3). 


Now  eliminate  p2  from  these  two  equations  and  obtain 

S_1(02)  S(It  +  pt)  -  S'1^)  S(I2  +  P{) 

Pz=  Sd!  +  pj)  -  S(I2  +  pp 


(4.5) 


and 

S_1(02)  S«!  +  Pi)  -  S_1(0,)  S(I2  +  pj)  S_1(03)  S(I2  +  Pj)  -  S'H0x)  S(I3  +  Pj) 

Sdj  +  Pi>  —  S(I2  +  Px )  =  Sd!  +  p!>  —  S(I3+  P3) 

After  eliminating  the  denominators  and  simplifying,  Equation  4.6  gives 

a  S(Ii  +  Pi)  +  b  Sd2  +  Pi)  +  c  Sd3  +  Pi)  =  0,  (4.7) 

where  a  =  S’kOa)  -  S-kOa).  b  =  S'^Oi)  -  S-1(03),  c  .  S-1(02)  -  S'kOi). 

Since  b  =  -(a  +  c),  we  can  write  Equation  4.7  as 

Sd3  +  Pt)  -  S(I2  +  Pi)  a 

S(I2  +  P1)-S(I1  +  P1)  (  ' 

Note  that  since  S  is  monotonically  increasing  and  Oi  <  02  <  O 3,  a  >  0,  and  c  >  0,  so  • 

the  ratio  a/c  >  0.  Now  the  question  of  whether  the  system  of  Equations  4.3  has 
a  solution  reduces  to  the  question  of  whether  there  is  a  shift  parameter  Pi  that 

will  solve  Equation  4.8.  If  Pi  exists,  then  Equation  4.5  gives  p2  and  W  is  given 

by  Equation  4.4  provided  S(Ii  +  Pi)  *  0.  Whether  or  not  Equation  4.8  has  a 
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solution  depends  on  the  specific  sigmoid  function  S.  Up  to  this  point,  for  the 
sake  of  generality,  the  sigmoid  function  S  has  been  left  unspecified.  The 

analysis  of  the  previous  section  holds  independently  of  the  particular  details 
of  the  specific  sigmoid  function  employed  as  long  as  it  is  continuous.  Thus, 

the  results  we  have  so  far  are  valid  for  all  sigmoids,  which  form  a  large  class 
of  functions.  In  fact,  they  are  valid  for  any  continuous  function  S.  However, 
the  question  of  whether  Equation  4.8  has  a  solution  clearly  depends  on  the 
specific  sigmoid  used.  If  the  sigmoid  has  compact  range — that  is,  if  it  saturates 
above  and  below  so  that  the  sigmoid  is  a  constant  for  large  and  small  values  of 
its  argument — then  Equation  4.8  always  has  a  solution  (Proposition  4.1). 
Otherwise,  it  may  not  have  a  solution  (Proposition  4.2).  Thus,  the  shape  of  the 
sigmoid  is  indeed  relevant  at  this  point.  //// 

Definition  4.2  -  By  a  sigmoid  that  saturates  we  shall  mean  a 

continuous,  nondecreasing  function  S:  1]  that  is  onto  and  is  strictly 

increasing  on  S"*((-l,  1)).  (Recall  that  S’^-l,  1))  =  {x  e  :  S(x)  e(-l,  1)}.) 

Proposition  4.1  •  If  S  is  a  sigmoid  that  saturates,  then  Equation  4.8 
always  has  a  solution. 

We  can  show  (Proposition  4.2  below)  that  if  S  is  an  inverse  tangent,  S(t) 

2  i 

=  —  tan'l(0.  then  there  are  inputs  Ij ,  I2,  I3  so  close  to  each  other  that  the 

ratio  in  the  LHS  of  Equation  4.8  is  greater  than  some  positive  number  T|  for  all 
{)}€#.  Thus,  if  the  outputs  Oi  are  such  that  a/c  <  q,  then  Equation  4.8  has  no 

solution. 

2 

Proposition  4.2  -  Let  S(t)  =  —  tan*1  (t),  (t  e  R).  Let  Ik  =  k  -  1  for  k  =  1, 
2,  3,  and  let 


S(I3+P)-S(I2+P) 

*  Sdi+PJ-Sdj  +  P)  * 


(peR). 


There  exists  a  positive  number  q  such  that  y(P)  £q  for  all  Pe/?. 
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These  two  propositions  are  proved  in  the  Appendix. 

Remark  4.1  -  The  significance  of  Proposition  4.2  is  only  theoretical, 
since  any  hardware  implementation  of  the  sigmoid  would  result  in  a  sigmoid 
that  saturates  above  and  below,  achieving  a  maximum  constant  value  for 
large  arguments  and  a  minimum  constant  value  for  large  negative 
arguments.  Proposition  4.1  says  that  this  saturation  is  a  desirable  feature  of 
the  sigmoid.  //// 


Example  4.2  -  Consider  the  layered  network  defined  by  (L,  m,  W,  B)  * 


Wi  = 

wl 

(  ( 

wl 

(2,  (1.  2,  1), 

.w2. 

,84)  with  TF  TW(I)  =  S 

[w3w4]  S2 

I 

_  w2_ 

)y 

W2  = 

[W3W4]  J 

S(wj  S(wjl)  +  w4S(w2l)).  (I  €/?)  (see  Remark  2.3). 


Note  that  the  shift  vector  B  is  zero  in  this  example  and,  since  the 
sigmoid  is  an  odd  function  (i.e.,  S(-I)  =  -S(I)  for  all  I),  it  is  easy  to  see  that  Tw  is 

also  an  odd  function.  Thus,  the  input-output  pairs  of  this  network  cannot  be 
all  that  arbitrary.  Regardless  of  what  the  weights  may  be,  there  are  two 

constraints  that  must  be  satisfied  by  the  input-output  pairs  of  this  system  due 
to  the  fact  that  Tw  is  an  odd  function: 

(i)  Tw(0)  =  0  and 

(ii)  Tw(-I)  = -TW(I). 


For  this  network.  Inequality  4.1  gives  k  5  4.  Hence  we  cannot  expect  this 
network  to  leant  more  than  four  arbitrary  input-output  pairs;  moreover,  if 
some  pair  violates  (i)  or  (ii),  then  the  network  will  not  learn  it.  So,  again  we 
see  that  Inequality  4.1  gives  an  upper  bound  on  k;  however,  there  is  no 
guarantee  that  Equation  3.3  has  a  solution  when  k  satisfies  Inequality  4.1. 


% 


I 

% 
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We  can  use  this  example  to  illustrate  how  the  opposite  situation  also  can 
occur.  This  is  the  fortuitous  situation  in  which  k  exceeds  the  bound  in 
Inequality  4.1  yet  a  solution  to  Equation  3.3  exists. 

Suppose  that  the  network  of  Example  4.2  has  learned  the  four  input- 

output  pairs  (Ij,  Oj),  i  =  1,  2,  3,  4,  where  none  of  the  inputs  Ij  is  zero  and  no  two 

of  them  satisfy  I,  =  -Ij  for  i  *  j.  Then,  under  these  conditions,  we  could  "add" 

five  more  input-output  pairs  to  the  list  of  pairs  to  be  learned.  These  are  the 
negatives  of  the  four  pairs  above,  (-1,.  -Oi),  i  =  1,  2,  3,  4,  and  the  pair  (0,  0).  All 
nine  pairs  are  realizable,  so  it  would  seem  that  we  have  violated  Inequality 
4.1,  since  now  k  =  9  >  4.  However,  this  is  just  a  fortuitous  situation;  the  extra 
four  pairs  (-Ij,  -Oj),  i  =  1,  2,  3,  4,  do  not  impose  any  new  conditions  on  the 
weights.  They  are  "learned"  as  a  by-product  of  learning  the  first  four  pairs 

(Ij,  Oj),  i  =  1,  2,  3,  4.  The  pair  (0,0)  results  from  not  having  shifts  in  the 
sigmoids  (see  Remark  2.2).  //// 


5.  APPLICATIONS 


Inequality  4.1  together  with  Theorem  4.2  represent  the  main  result  of 
this  work.  An  attempt  was  made  to  illustrate  the  meaning  of  this  result.  This 
was  the  purpose  of  Examples  4.1  and  4.2.  It  is  hoped  that  the  comments, 
remarks,  and  examples  in  Section  4  are  sufficient  to  justify  calling  the 
quantity  on  the  RHS  of  Inequality  4.1  "the  capacity"  of  the  layered  net. 


By  Equations  2.3  and  2.5,  Inequality  4.1  says 


k  < 


mi+1  x  nij  + 


L+l 

I 


(5.1) 


Definition  5.1  -  If  a  LNN  has  L  layers  of  weights  (L  S  1)  and  (L+l) 
layers  of  neurons  with  mj  neurons  in  the  ith-layer,  m;  5  1,  i  =  1,  2,  3,  ....  (L  + 
1),  then  the  (L  +  1  )-tupIe  m  a  (mj,  m2,  ....  ^l+i)  be  called  the  architecture 

of  the  LNN. 
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Definition  5.2  -  If  a  LNN  has  architecture  m  ■  (itij,  m2 . mL  +  1),  tflen 

the  RHS  of  Inequality  5.1  will  be  called  the  capacity  of  the  LNN  and  will  be 
denoted  by  C(m).  Thus, 

(meJVL+1).  (5.2) 

Here,  represents  the  collection  of  (L  +  l)-tuples  of  positive  integers. 

In  this  terminology,  Theorem  4.2  says  that  if  the  number  of  input 
patterns  exceeds  the  capacity  of  the  LNN,  then  every  value  of  the  mapping  F 
defined  by  Equation  3.2  is  unstable. 


C(m) 


1 

“lui 


L  L+l 

X  mi  *  mi+l  +  X  m‘ 

i-1 


Now  that  the  notions  of  "architecture"  and  "capacity"  of  an  LNN  have 
been  defined,  we  shall  derive  a  few  simple  results  that  follow  from  Definition 
5.2.  The  first  one,  which  is  perhaps  counterintuitive,  is  the  fact  that  the 
capacity  of  an  LNN  decreases  as  the  number  of  output  neurons  increases.  This 
becomes  evident  after  we  rearrange  terms  and  write  C(m)  as 


C(m)  =  1  +  mL  + 


mL+i 


L-l  L 

X  mi+l  x  n»i 
L  i-l  i-i 


»i  +  X  m> 


(5.3) 


Clearly,  C(m)  is  a  decreasing  function  of  the  number  of  output  neurons  mL+1. 

Thus,  increasing  mL+1  can  only  decrease  C(m).  In  fact,  if  the  total  number  of 

neurons  is  fixed,  then  mL+|  can  increase  only  at  the  expense  of  decreasing 

some  of  the  mj  (j  S  L),  which  clearly  would  reduce  C(m)  even  further.  So,  we 

have  discovered  the  following  practical  result. 


Theorem  5.1  -  (a)  Reducing  the  number  of  output  neurons  while 
keeping  the  total  number  of  neurons  fixed  increases  the  capacity  of  the 
layered  network.  (b)  Reducing  the  number  of  output  neurons  while  keeping 
the  number  of  neurons  in  all  the  other  layers  fixed  increases  the  capacity  of 
the  layered  network. 


* 


% 


I 

4 
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Note  that  in  part  (b)  the  total  number  of  neurons  actually  decreases  by 
decreasing  the  number  of  output  neurons,  yet  the  capacity  still  increases. 

In  practice,  one  may  not  always  be  able  to  choose  niL+j,  the  number  of 

neurons  in  the  output  layer.  However,  in  those  applications  where  one  has 
the  freedom  to  choose  mL+1,  Theorem  5.1  suggests  that  it  be  chosen  as  small  as 

possible. 

We  conclude  this  section  with  a  series  of  results  on  maximizing 
architectures:  that  is,  architectures  that  will  maximize  the  capacity  C(m)  of  an 
LNN  when  the  total  number  of  neurons  is  fixed  and  with  a  fixed  number  of 
layers  of  weights. 


The  total  number  of  neurons  of  the  LNN  will  be  denoted  by  N  as  in 
Equation  2.5  and,  as  usual,  L  will  denote  the  number  of  layers  of  weights.  For 

a  given  L,  the  architecture  that  maximizes  the  capacity  of  the  net  will  be 
denoted  by  m^.  and  C*L*C(mL)  is  the  maximal  capacity  of  the  net. 


Theorem  5.2  -  Let  m, 
integers. 


L+l 


a’  mL+l  =  and  N  =  ntj  be  fixed  positive 


i»l 


( a )  If  L  =  2  and  N  >  a  +  P  ,  then  m^  —  (a.N-a-p.f))  and 
C\=  -i[(l  +  a  +  P)  N  -  (a  +  p)2]. 

(b)  If  L  =  3,  N  >  2p,  N  >  2a,  and  N  is  even,  then  m^  =  (a,  N/2  -  p,  N/2  -  a,  P)  and 

C;  =  I[N  +  N2/4  -  op]. 

P 

(c)  If  a  >  P,  L  =  4,  (N  -  p )  is  even,  and  (N  -  P)  >  2a,  then 

m;=(°,  1(N-P)- l,i(N-p)-a,  l,p)  and  C$  =  I[n  +  (N  -  p)2/4  +  <p  -  a)]. 

(d)  If  a  <  P ,  L  =  4,  (N  -  a  )  is  even,  and  (N  -  a  )  >  2P  ,  then 

mj  =  |a,  1,  j(N-a)-p,  i(N  -  a)  -  l,p)  and  c;=  j[n  +  (N  -  a)2/4  +  (a  -  p)]. 
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(c)  If  a  =  p,  L  =  4,  (N-a)  is  even,  and  N  -  3a  >  2,  then 

m4=  (a-  y  (N  -  a)  -  m4  ,  j  (N  -  3a)  ,  m4 ,  a),  with  1  sm4  (N  •  a)  •  1, 

and  CJ=  I[N  +  (N-a)2/4], 

P 

(f)  Under  the  hypothesis  of  (b)  and  (c),  or  (d),  or  (e),  C3>C^. 

(g)  C2SC3  if,  and  only  if,  4aP  >  [2(a  +  P)  *  N]2. 

Proof  -  See  Appendix. 

Remark  5.1  -  If  and  are  required  to  equal  a  and  p,  respectively, 
then,  when  L  =  2,  there  is  only  one  choice  for  m;  thus,  m^  is  necessarily  this 

unique  choice.  In  this  sense,  the  result  of  part  (a)  is  trivial.  However,  it  is 
included  so  that  can  be  compared  with  and  C*  Part  (f)  suggests  that  L  be 

chosen  no  larger  than  3  in  situations  where  it  is  important  to  achieve 

maximal  capacity.  The  results  of  parts  (c)  and  (d)  are  interesting,  for  they 

show  that  when  L  =  4  and  a  *  P,  maximal  capacity  requires  an  architecture 
with  a  layer  that  has  only  one  neuron.  This  phenomenon  persists  when  L  >  4. 
Finally,  we  can  see  from  part  (g)  that  if  N  is  "small"  compared  with  2(a  +  P) — 
"small"  being  defined  by  the  inequality  in  (g) — then  L  =  2  gives  a  larger 
capacity  than  L  =  3.  //// 

As  Theorem  5.1  shows,  in  order  to  maximize  the  capacity  of  an  LNN, 
mL+l  should  be  selected  as  small  as  possible.  In  the  next  theorem,  we  are 

going  to  set  m^  +  i  =  t  and  let  mt  be  free,  with  the  total  number  of  neurons 

fixed.  The  results  that  follow  will  complement  those  of  Theorem  5.2,  where  mt 
was  fixed  at  a  given  value  a  and  mL+1  was  fixed  at  p.  This  will  tell  us  what  the 
optimal  value  of  a  would  be  when  P  =  1. 

L+i 

Theorem  5.3  -  Let  ntj_+j  =  1  and  N  =  ^5^  be  fixed. 

i*l 
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( a )  If  L  =  2  and  N  is  even,  then  mj=  N  -  1,  jN,  1  j  and  Cj  =  N  +  N2/  4. 

(b)  If  L  =  3  and  N  is  even,  then  mj=  ^1,  y  N  -  1, —■  N  -  1,  1  j  and 

c’3=n  +  1n2-  i. 

(c)  If  L  =  4  and  N  is  odd,  then  mj=  ^1,  •j  (N  -  1)  -  m^  y  (N  -  3),  m*  lj,  with 
1  S  m4  S  y(N  -  3)  and  Cj=  i  (N  +  l)2. 


Proof  -  See  Appendix. 

Note  that  under  the  conditions  of  Theorem  5.3,  we  now  have  C2>C3  and 

the  maximal  capacity  decreases  as  L  increases.  Also,  note  that  maximal 
capacity  calls  for  m3  =  1  when  L  =  3  or  4.  This  phenomenon  persists  when 

L  >4. 


6.  SUMMARY  AND  CONCLUSIONS 

The  essential  components  and  the  architecture  of  a  general  (feed¬ 
forward)  LNN  were  defined.  The  equations  that  must  be  satisfied  by  the 
weights  and  shifts  of  the  network  in  order  to  team  a  set  of  input-output  pairs 
were  derived.  These  form  a  set  of  nonlinear  equations  that  were  called  the 
equations  of  learning.  If  the  set  of  pairs  to  be  learned  is  known  a  priori,  then 
one  can  (in  principle)  write  down  these  equations  and  interpret  the  learning 
process  as  a  process  of  solving  (numerically)  the  equations  of  learning.  Thus, 
a  host  of  new  learning  algorithms  are  now  at  our  disposal;  namely,  all  of  the 
known  algorithms  for  solving  systems  of  nonlinear  equations  (References  17 
and  18),  which,  in  this  setting,  can  be  interpreted  as  learning  algorithms. 
Some  of  these  may  prove  to  be  faster  and/or  more  efficient  than  any  of  the 
currently  applied  learning  algorithms,  such  as  the  Delta-Rule  or  Back- 
Propagation-Error  (Reference  9),  which  is  one  of  the  most  popular  methods. 
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Whether  one  can  exploit  the  special  structure  of  the  equations  of 
learning  to  improve  or  adapt  some  of  the  existing  algorithms  for  solving 
systems  of  equations,  so  that  these  algorithms  become  useful  as  learning 
algorithms  that  might  compete  against  existing  learning  algorithms,  is  a  topic 
of  further  research  which  will  not  be  addressed  here.  It  is  left  as  an  open 
problem. 

By  counting  the  number  of  degrees  of  freedom  of  a  layered  net  (which 
is  equal  to  the  dimension  of  the  (weight  x  shift)-space  of  the  net)  and 
bounding  the  number  of  equations  in  the  equations  of  learning  by  this 

number,  an  upper  bound  on  the  number  of  distinct  input-output  pairs  that 
the  net  can  leant  was  obtained.  Example  4.1  together  with  Proposition  4.1 
show  that  the  upper  limit  can  be  achieved  making  the  bound  sharp.  Two 
examples  to  illustrate  the  result  of  Theorem  4.2  and  its  limitations  where 

included. 

While  analyzing  Example  4.1,  it  was  discovered  that  one  can  always 

solve  Equation  4.8  if  the  sigmoid  function  saturates.  It  was  discovered  also 
that  if  the  sigmoid  approaches  its  asymptotic  values  too  slowly  (like  an 

arctangent  for  example),  then  Equation  4.8  may  not  have  a  solution  if  the 
inputs  are  too  close  to  each  other.  Thus,  saturation  is  a  desirable  feature  of 
the  sigmoid.  This  result  is  significant,  because  the  simple  net  of  Example  4.1 
can  be  thought  of  as  a  building  block  for  larger  nets.  Whenever  a  net  has  two 
or  more  layers  of  neurons  with  forward  connections,  one  has  the  simple  net 
of  Example  4.1  embedded  in  it. 

Finally,  in  Section  5,  we  defined  the  concepts  of  architecture  and 
capacity  of  an  LNN  and  gave  a  few  results  on  architectures  with  maximal 
capacity. 
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Appendix 

PROOF  OF  PROPOSITIONS  4.1  AND  4.2  AND  THEOREMS  5.2  AND  5.3. 

Observation  A. I  -  If  S:  1]  is  a  sigmoid  that  saturates,  then 

there  exist  two  points  tj  and  t2  such  that  li  <  (*  "■  *1]  =  S_1({-1})  ,  [t2»  00 )  = 

S-1({+1}),  and  on  [tj,t2]  S  is  a  strictly  increasing  function  with  range  [-1,  1]. 

Proposition  4.1  -  Let  Ii  <  I2  <  I3  be  three  real  numbers  and  a  any 
non-negative  number.  If  S  is  a  sigmoid  that  saturates,  then  there  exists  Pel? 
such  that 


S(I3+0)-S(I2+p) 

s(i2+p)-s(i1  +  p)  =a‘  (AJ) 

Proof  -  For  each  peR,  let  gt(P)  ■  S(I3  +  P)  -  S(I2  +  P)  and  g2(P)  ■  S(I2  +  P) 
-  S(Ij  +  P).  Since  S  is  strictly  increasing  on  [tj,  t2]  and  Ij  <  I2,  we  conclude  that 
g2(P)  >  0  for  every  (tj  -  I2,  t2  -  I2]. 


It'D  m  (tj  -  I2,  t2  -  I2]  and  f(p)  ■  —  —  for  every  Pe  D  ,  then  f  is  the  ratio 

of  two  continuous  functions  with  a  denominator  that  never  vanishes;  thus,  f 
is  continuous  on  'D  .  Next,  we  shall  show  that  the  range  of  f  is  the  infinite 

interval  [0,  00 ).  Since  f  is  continuous,  this  can  be  accomplished  by 
showing  that  f(t2  -  I2)  =  0  and 

lim  f  (p)  =  00.  (A. 2) 

p->trI2 

Recall  that  S(P)  «  1  for  p  s  t2  and,  since  I3  >  I2,  gj  (i2  -  I2)  =  0-  Hence,  f(t2  - 12)  = 
0.  To  see  that  Equation  A.2  holds,  let  b  *  gj  (t  j  -  I2).  Since  I3  >  I2  and  S  is 
strictly  increasing  on  [tj,  t2],  b  >  0.  (Note  that  if  I3  -  I2  +  lj  >  l2*  thcn  b  =  2-) 
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Next,  since  g1  is  continuous,  there  exists  e  >  0  such  that  gj(P)  >  b/2  for  all  Pe(tj 
-  I2,  tj  -  I2  +  e).  Therefore, 

f(p)  >  “^7  for  all  pe  -  I2.  h  -  h  +  e).  (A. 3) 

2g2(P) 

Now,  since  g2(t{  -  I2)  =  0  and  g2  is  continuous,  g2(P)  becomes  arbitrarily  small 
as  P-»(ti  -  I2).  Consequently,  Inequality  A. 3  implies  Equation  A.2. 


We  have  shown  that  the  CHS  of  Equation  A.l,  represented  by  f(P),  can 
attain  any  nonnegative  value  as  P  varies  through  the  interval  *D  ;  thus, 
whenever  a  2  0,  Equation  A.1  has  a  solution  with  p e2?.  //// 


Proposition 


4.2  -  Let  S(t)  =  -jy  tan*1  (t),  (t  e  R).  Let  Ik 


k  -  1  for  k  =  1, 


2,  3  and  let 


_  S(I3+P)-S(I2+P) 

■y(B)  n  -  . . -  ■-  . . .  ■  , 

P  S(I2+P)- S(I1  +  P) 


<p  €/?). 


(A. 4) 


There  exists  a  positive  number  q  such  that  y(P)  2  r\  for  all  Pe/?. 

Proof  -  Let  S'  denote  the  derivative  of  S.  The  following  function  of  two 
variables  will  be  instrumental  in  the  proof. 

» 

f(x,  y)  m  —  ^  +  y?  ,  x  e  R  ,  y  6  [0,  21 
S  (x) 

Since  S'(x)  >  0  for  all  x,  f(x,  y)  >  0  for  all  x  e  R  and  y  e  [0,  2].  Furthermore,  f  is  a 
continuous  function.  Thus,  if  we  restrict  f  to  a  compact  rectangle  of  the  form 
[a,  b]  x  [0,  2],  where  a  <  b,  f  will  attain  a  minimum  value  \  >  0.  We  shall  use  this 
fact  later. 


By  the  Mean  Value  Theorem,  for  each  Pe/?r  we  can  find  two  numbers, 
Oj(P)  and  a2(P),  such  that 


» 


v 
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3  <  a2(p)  <1  +  3 

* 

1  +  3  <  a2(P)  <2  +  3  and 

S(2  +  3)  -  S(1  +  3)  =  S'(a2(p)) 

S(l  +  3)-S(3)  =  S,(a1(3». 

Let  T(3)  m  a2(3)  -  0^(3),  so  that  a2(3)  =  ai(P)  +  T(3).  Then,  by  Inequalities  A.5, 

T(3)  e  <0,  2) ,  V3  eR.  (A.7) 


Comparing  Equations  A.4  and  A.6  we  get 


y<3)  =  ,  ( 3 e/? ). 


S’ta^p)) 


(A. 8) 


Since  S’(t)  =  rr - r,  (teR),  a  quick  calculation  using  Equation  A. 8  shows  that 

11  1  +  r 

1  +  <x?(p) 


■K3)  = 


*,  (Pe/?),  which,  for  all  T(3)  e  [0,  2],  approaches  1  as 


1  +  [djlp)  +  Tipir 

I  ai(P)  l-»  Therefore,  since  3  <  ai(3)  <3  +  1,  there  exists  (J  >  0  such  that 


Y(P)  >  "?  ,  for  Ipl  >  3. 


(A. 9) 


Next,  let  X.  >  0  be  the  minimum  value  of  f(x,  y)  for  x  e  [-  3,  3  +1]  and  y  e  [0,  2]. 
By  the  choice  of  X,  Inequality  A.5,  Statement  A.7,  and  Equation  A. 8  we  conclude 

7(3)  2  X  for  Ip  Up.  (A. 10) 

Now,  if  q  *  min{X,  y},  then  Inequalities  A.9  and  A.  10  imply  that  y(3)  5  q  for  all 

3  eR.  //// 

Proof  of  Theorem  5.2  -  The  expressions  for  C*  C^,  and  Cj  can  be 
obtained  from  Equation  5.2  and  m2,  m3,  and  m^,  respectively,  using  elementary 
algebra.  This  is  also  true  of  mj  (see  Remark  5.1). 
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L+l 

Since  mlt  mL+1,  and  N  are  fixed  and  N  =  ^  mj,  the  capacity  C(m)  can  be 

i*l 

expressed  as  a  function  of  L  -  2  variables  when  L  2  3.  For  each  L  2  3  and  (m3, 

L 

m4,  ....  mL)  e/?L-2,  let  fL(m3,  m4,  ....  mL)  ■  p-C(a,  N  -  a  -  P  -  ^  ntj,  m3,  ....  mL,  P). 

i*3 

L 

Note  that  this  amounts  to  setting  m2=N-a-P~  ^  m;.  It  suffices  to  show  that 

i»3 

fL  achieves  its  global  maximum  when  m;  equals  the  ^-component  of  m^  (i  = 
3,  4,  ....  L;  L  ss  3.  4). 

If  L  =  3,  then  f3  (m3)  =  N  +  [a  +  m3]  [N  -  a  -  P  -  m3]  +  Pm3,  (m3  e  /?).  Let 

m*3=y  -  a,  then  a  simple  calculation  gives  f3(m3+ x)  =  pC5~  x  ,  Vxe/?.  Thus, 

f3(mj+ x)  S  f3(m^)  =  pc5  ,  Vx  € /?.  This  implies  that  f3(m^)  is  the  global  maximum 
of  f3  and  (b)  has  been  established. 

If  L  =  4,  then  f4(m  3>  m^)  =  N  +  (a  +  m^  [N  -  a  -  p  -  m3  -  m^J  +  4  +  pm4  , 

(m3.  m4)e/?  .  Let  m3  =  y(N  -  P)  -  a,  m4  =  1  and  assume  a  >  p.  A  simple 
calculation  gives 

f4(m  3  +  x,  m4  +  y)  =  pC4  -  [x2  +  (a  -  p)y] ,  V(x,  y)  €  R2. 

Since  x^  +  (a  -  P)y  2  0  for  all  x  e  R  and  all  y  2  0,  we  have 

f4(m3  +  x,  m4+  y)  S  f4  (m3,  m^)  =  pcj ,  Vx  e/?,  y  2  0.  (A.  11) 

Note  that  since  m4  =  1,  it  cannot  be  perturbed  by  y  <  0  in  our  setting.  It  follows 
from  Inequality  A. 11  that  f4(m3,  m^)  is  the  global  maximum  of  f4,  which 
establishes  (c). 

Since  m3C(m{,  mj,  m3,  m4,  m3)  =  mjC(mj,  m4,  m3,  m2,  mj),  (d)  follows 
from  (c)  by  interchanging  the  roles  of  a  and  p. 
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If  a  =  p  and  mj  =  y(N  -  3a),  then 

f4(ni3  +  x,  y)  =  N  +  (N  -  a)2/4  -  x2  =  p(!^-  x2,  V(x,  y)  e/?2 

This  shows  that  f4  achieves  its  global  maximum  on  the  set  of  points  {(m3,  y)  : 
ye/?}.  However,  in  our  setting  y  must  be  restricted  to  the  allowable  values  of 
m4,  which  are  1  S  m4  £  y(N  -  a)  -  1.  The  lower  bound  on  m4  is  clear.  The 
upper  bound  is  determined  by  the  requirement  that  m2  £  1.  Since  m2  =  N  -  2a 

-  m^  -  m4,  when  =  m*3  ,  we  have  m2  =  y(N  -  a)  -  m4.  Hence,  m2  >  1 
1 

implies  m4  s  y(N  -  a)  -  1.  Now  (e)  has  been  established. 

Under  the  hypothesis  of  part  (c),  we  have  2^-2.  -  lj  >  0  and  (N  -  2a)  -  ^-p 
>  0,  which  imply  the  following  series  of  inequalities: 

0  <  “p[N-yP-2a+2|-2 

=*  -IpN  +  ip2-a+P  <  -ap 

2  4 

=*  i[N2-2pN  +  p2] +  (p-a)  <  i-N2-ap 

=>  N  +  (N -p)2/4  +  (p  -  a)  <  N  +  — N2-  ap. 

4 

Thus,  when  a  >  p,  <C$. 

Under  the  hypothesis  of  part  (d),  we  have  2^^-  1  j  >  0  and  (N  -  2P)  - 
ya  >  0.  These  give  (as  above,  by  interchanging  a  and  P) 

N  +  (N  -  a)2/4  +  (a  -  p)  <  N  +  In2-  ap. 

4 
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Under  the  hypothesis  of  part  (e),  we  have  2N  -  6a  >  4;  hence, 

2N-5as4+a>0  =>  -4a  >  -  2N  +  a  =>  -  4a2  >  -2aN  +  a2 

=>  N2-  4a2 >  (N  -  a)2  =*  N+  !n2-  a2>  N  +  (N  -  a)2/4. 

4 


Thus,  when  a  =  P,C3<C3.  This  establishes  (f). 
Finally,  we  have 


<=> 


<=> 


<=* 


4ap  s  [2(a  +  p)-N]2 


ap  > 

-  (a  +  p)2+ (a  +  P)N  S 
N  +  (a  +  p)N-(a  +  p)2  ;> 


2  N2 

(a  +  p)z  -  (a  +  p)N  +  -~ 


N2 

N  +  4- -ap 
4 


C2  s  c5 . 


This  completes  the  proof  of  Theorem  5.2. 


//// 


Proof  of  Theorem  5.3  -  We  shall  use  the  same  technique  used  in  the 
proof  of  Theorem  5.2.  The  expressions  for  Cj,  C>  and  C4  can  be  obtained  from 
Equation  5.2  and  mj,  m3,  and  mj,  respectively,  using  elementary  algebra. 


For  each  L  2  2  and  (m2,  m3,  ....  mL)  let 

L 

*l( m2 •  m3*  •—  mL^  “  C(N  -  1  -  y .  m;,  m2,  m3, 

i-2 


Note  that  this  amounts  to  setting 


mL,  1). 


!> 


I 
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L 

mi  =  N-l- J  m,  ,  (L  2  2).  (A. 12) 

i-2 

It  will  be  shown  that  fL  achieves  its  global  maximum  when  mj  equals 
the  ith-component  of  ihl  ( i  =  2,  3 .  L;  L  =  2,  3,  4). 

If  L  =  2,  then  f2(m2)  =  N  +  [N  -  1  -  m2]  m2  +  m2  =  N  +  [N  -  m2]  m2.  Let  m2  = 

1  n2 

yN,  then  f2(m2  +  x)  =  N  +  -y  -  x2  =  f2(m^)  -  x2  (x  e  R).  Thus,  f2(m2  +  x)  s 
f2(m^)  =  Cj,  Vx  e  R.  This  gives  (a). 


If  L  =  3,  then  f3(m2,  m3)  =  N  +  [N  -  1  -  m2]m2  +  m3.  Let  m2  =  yN  -  1  =  m3  , 
then  f3(m2  +  x,  m3  +  y)  =  C3  -  x2  +  (x  +  y).  Since  mj  S  1,  Equation  A.12  implies 
that  x  +  y  <,  0.  Therefore,  f3(m2  +  x,  m3  +  y)  £  f3(m2,  m^)  =  whenever  x  +  y  <  0. 
This  shows  that  C3  is  the  global  maximum  of  C3  under  the  constraints  of 
Equation  A.12  and  nij  2  1.  This  gives  (b). 

IfL  =  4,  m2  =  y(N  -  1)  -  m4,  m3  =  y(N  -  3),  and  m4  is  arbitrary  for  the 
moment,  then 

f4(m2  +  x,  m3  +  y,  m^  =  y[N  +  l]2  -  x2  +  m4(x  +  y)  =  C4 -  x2  +  m4(x  +  y). 

As  in  case  (b),  nij  2  1  and  Equation  A.12  implies  x  +  y  5  0.  Therefore, 
f4(m2+  x,  m3  +  y,  m4)  £  f4(m2,  m^.  m4)  =  Cj  whenever  (x  +  y)  <,  0,  and  for  all 
m 4€/?.  This  shows  that  C4  is  the  global  maximum  of  C4  under  the  constraints 
of  Equation  A.12  and  mj  >  1,  independently  of  the  value  of  m4.  However,  in 

our  setting,  we  must  have  1  <  m4  <,  y(N  -  3).  The  lower  bound  on  m4  is  clear. 

The  upper  bound  is  obtained  by  noticing  that  m2  =  y(N  -  1)  -  m4  >  1.  This 
completes  the  proof  of  Theorem  5.3.  //// 
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